A neural encoder for earthquake rate forecasting

Forecasting the timing of earthquakes is a long-standing challenge. Moreover, it is still debated how to formulate this problem in a useful manner, or to compare the predictive power of different models. Here, we develop a versatile neural encoder of earthquake catalogs, and apply it to the fundamental problem of earthquake rate prediction, in the spatio-temporal point process framework. The epidemic type aftershock sequence model (ETAS) effectively learns a small number of parameters to constrain the assumed functional forms for the space and time correlations of earthquake sequences (e.g., Omori-Utsu law). Here we introduce learned spatial and temporal embeddings for point process earthquake forecasting models that capture complex correlation structures. We demonstrate the generality of this neural representation as compared with ETAS model using train-test data splits and how it enables the incorporation additional geophysical information. In rate prediction tasks, the generalized model shows \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>4\%$$\end{document}>4% improvement in information gain per earthquake and the simultaneous learning of anisotropic spatial structures analogous to fault traces. The trained network can be also used to perform short-term prediction tasks, showing similar improvement while providing a 1000-fold reduction in run-time.

As a benchmark, we used the FORTRAN implementation of the ETAS model published by Zhuang [1][2][3][4]. The model writes the conditional intensity as λ(x, y, t) = µ 0 (x, y, t) where x, y, t are space and time coordinates at which the rate is evaluated. and x i , y i , t i , z i , M i are cataloged values of past earthquake's space and time coordinates, depth and magnitude, respectively. The sum is restricted to events that occurred prior to the evaluation time t.

II. DESCRIPTION OF THE ENCODERS
The catalog encoder consists from two separate models, each capturing different properties of earthquake distributions. The first model, described in Section II A, summarizes the contributions of recent earthquakes on the seismicity in the studied domain. It can be thought of as a direct generalization of the f function in Equation (1). The second model, in contrast, describes the seismicity (i.e. the average rate of earthquakes in the past day or month or year) in the studied domain. It is described in Section II B. Additionally, we introduce in Section II C an encoder that does not directly encode earthquakes, but rather, can learn location-dependent weights. It can be thought of as an approximation of the background rate function µ in Equation (1).
The models below are structured as follows. Define some study region A, and a time range [T 0 , T 1 ]. Given a location (x, y) ∈ A, a time t ∈ [T 0 , T 1 ] and a catalog of earthquakes Ω, every model is some function E(x, y, t, Ω) → R n (e.g. the long term seismicity model will return a vector of average rates of earthquakes for different regions around (x, y) and time ranges before t).
The models have two stages -a feature calculation stage, and an encoding stage. The feature calculation is a human-engineered function F (x, y, t, Ω) → R e of real values, named "features". These features represent values that are known to be relevant to the distribution of earthquakes, e.g. one feature we use below is the reciprocal of the time that elapsed since previous earthquakes. The full list of features is given in Section III A.
The encoding stage is some (trainable) machine learned model M (v) → R n . So the final encoding function E(x, y, t, Ω) is E(x, y, t, Ω) = M (F (x, y, t, Ω)). The architecture of the model can be easily changed in the future, e.g. by using a Recurring Neural Network (RNN) if we wish to capture some temporal connection between the features, or a Convolutional Neural Network (CNN) if we suspect that there is some relationship between spatially adjacent features. Here we use multilayer perceptrons, described in Section III A. The outputs of these models are concatenated, and then used as an input to an inference model. This approach is very versatile -we can add more encoding models, change some of the feature calculation, or replace the architectures -all depending on the task that we want to solve. Simultaneously, we can transfer trained encoders to solve related tasks, as we describe in the main text.
The source code is available at [5].

A. Recent Earthquakes encoder
The goal of this encoder model is to generalize the f function in Equation (1). In other words, we want to be able to express a sum of some contribution from every earthquake in the (relatively recent) past. We limit ourselves to some N earthquakes in the past (N can be fairly large, covering on average years before t).
The feature engineering function F recent (x, y, t, Ω) in this case outputs a matrix R e×N . Each column in R e×N corresponds to a single earthquake (out of the past N earthquakes). Denote the location, depth, time and magnitude of a past earthquake by x j , y j , z j , t j and M j , respectively, 1 ≤ j ≤ N . We choose some e functions f i (x, y, t, x j , y j , z j , t j , M j ) → R for 1 ≤ i ≤ e, that calculate values that are known to be useful for estimating the target output. Every row in R e×N corresponds to a function, meaning that R i,j = f i (x, y, t, x j , y j , z j , t j , M j ). Unlike the ETAS f , these are non-parametric functions -the parameterization is replaced with the encoding stage. Specifically, for the application in Section III, the feature functions are the building blocks of the ETAS f function. They include the exponent of the magnitude, f 1 = e Mj , the reciprocal of elapsed times f 2 = (t−t j ) −1 , etc. The full list appears in Section III A. In such applications there is usually little harm in adding more feature functions, and testing whether they improve the performance of the model.
The encoder we choose is invariant to the order of the input earthquakes. Firstly, this closely mimics the behavior of the sum in ETAS. Additionally, this has the advantage that a small change in the input earthquakes won't change the encoding significantly. We achieve this order invariance by using a series of convolutional layers with shapes (1, e j ) and e j+1 filters per layer (here e 0 = e), and then summing along the earthquake axis.

B. Seismicity Rate encoder
Here the goal is to capture long-and short-term seismicity in the study region -the average rate of earthquakes per day or per month, before t and around (x, y). Intuitively, if the region around (x, y) had recently seen a lot of seismicity, it is natural to predict that there will be a lot of seismic activity there in the near future as well.
The straightforward way to build features for this model is as follows. Define a set of time look-back , and a set of magnitude thresholds {m k } M k=1 . Then, given an input space-time (x, y, t), for every 1 ≤ i ≤ B, 1 ≤ j ≤ D and 1 ≤ k ≤ M , calculate n i,j,k (x, y, t) FIG. 1. Schematic of the Recent earthquakes encoder. We encode each earthquake separately, using some feature engineering functions. We then apply a trainable function on every encoded earthquake, and sum the results. [h] FIG. 2. Schematic of the Seismicity encoder. We calculate average rates of earthquakes for different time ranges, magnitude thresholds, and regions around the input point. We then apply a trainable function, that uses weight sharing in order to infer the impact of rare (large) earthquakes from frequent (smaller) earthquakes.
-the number of earthquakes in the b i seconds prior to t, that are of magnitude at least m k and are in the (axis-aligned) square with side d i meters centered in (x, y).
We could then take a vector of all n i,j,k (x, y, t) and use some model to encode them. However, note what happens for small values of b i , d j , and large values of m k . For such values, it is possible that we do not see many examples of (x, y, t) in the training set for which n i,j,k (x, y, t) > 0. However, we know that if n i,j,k (x, y, t) > 0 for some small b i , d j , and a large m k , then it greatly increases the future rate of earthquakes near (x, y). In order to help the model infer from values of b i , d j , m k that often have n i,j,k (x, y, t) > 0 to ones where n i,j,k tends to be low (or 0), we add more features and a specific encoder.
The feature engineering function F seismicity (x, y, t, Ω) outputs a matrix S u×B·D·M . Each column corresponds to a single choice of b i , d j and m k . The first row holds the values n i,j,k (x, y, t). From the second row onward the values are functions g s (b i , d j , m k ) for 1 ≤ s ≤ u − 1. For instance, the second row can hold the exponent of m k (that correspond to the column), the third row can hold the reciprocal of b i , and so on. Again, the functions are building blocks inspired by ETAS, and there is little harm in adding more feature functions. The full list of feature functions appears in Section III A.
In order to infer from some values of b i , d j , m k to others, we use a similar architecture to the one used in Section II A. We start by applying -some r 1-column convolutional layers with shapes (1, u j ) and u j+1 filters per layer (here u 0 = u and 0 ≤ j ≤ r − 1). Thus we learn a set of u r functions h t (x, y, t, n i,j,k (x, y, t), b i , d j , m k ), 1 ≤ t ≤ u r . Note that the functions are shared for different values of b i , d j , m k , even those where n i,j,k (x, y, t) is often equal to 0. We then flatten the resulting u r × B · D · M matrix into a single vector, and apply some fully connected layers to combine these functions.

C. Location encoder
The final encoders attempts to capture the general spatial correlation between the location (x, y) of the example and the target output. We divide the study region A into a regularly spaced grid and represent the location (x, y) with a 1-hot encoding of the grid cell that contains (x, y). The encoding stage is simply a single fully connected layer applied on the 1-hot encoding.

III. RATE PREDICTION
Our first goal is to build a model that would output the conditional intensity function λ(x, y, t|H t ) of earthquakes in A above some threshold magnitude M 0 , similarly to the output of ETAS. We call this model FERN (Forecasting Earthquake Rates with Neural nets). FERN is a parametric model, and we sometimes write FERN θ to emphasize that θ is some assignment to the various weights in the model (both in the encoders that were described above, and in the head model described below).
The output of ETAS is notoriously difficult to integrate numerically. Assuming that our model will produce a similar output, we wish to avoid numerical integration when calculating the log-likelihood loss. We therefore follow the architecture in [6], by introducing a 'lead time' ∆t, and outputting: Precisely as described in [6], we can calculate λ(x, y, t|H t− ) by using back-propagation to the input ∆t. Also following [6], the head model is a series of fully connected layers with monotonically increasing activation functions (tanh) and non-negative weights -thus we achieve that the back-propagation of the output to the input ∆t is non-negative.
The loss function L that we optimize for is (an approximation of) the negative of the log-likelihood of the observed catalog under the conditional intensity function the is produced by the model FERN θ (x, y, t, ∆t).
Define (n + 1) · q training examples, where q is some constant (and n is the number of target earthquakes in the training set, as before). The examples are denoted by e i,j , for 1 ≤ i ≤ n + 1 and 1 ≤ j ≤ q. In order to calculate this loss during training each example e i,j is responsible for calculating a part of this formula -the loss of the example L(e i,j |θ). As shown below, the sum of the contributions of all examples L(e i,j |θ) approximates the log likelihood L(θ).
We construct the examples as follows. Extend the series {t i } n i=1 by adding t 0 = T 0 and t n+1 = T 1 . For every example e i,j , define the 'history time' h i = t i−1 , 'lead time' i = (t i −t i−1 ), and 'location' (a i,j , b i,j ) ∈ A. Set (a i,1 , b i,1 ) = (x i , y i ). For 2 ≤ j ≤ q, the locations (a i,j , b i,j ) can be arbitrary, but their choice can impact the accuracy of the approximation of L(θ). In our implementation we choose (a i,j , b i,j ) to be evenly spaced from (x i , y i ) on a regular grid in A.
We can define: Now, we assume that, as in ETAS, λ(x i , y i , t i |H ti− ) = λ(x i , y i , t i |H ti−1 ) (meaning that the events that happen since the last target earthquake do not affect the output). Note therefore that if we sum L(e i,j |θ) over all examples e i,j , we get exactly n i=1 log λ(x i , y i , t i |H ti− ) from the first case. As for the integral part, when summing L(e i,j |θ) over all 1 ≤ j ≤ q, we get an approximation of − ti ti−1 A λ(x, y, t|H t− )dxdydt. Summing these integrals for 1 ≤ i ≤ n + 1 gives us the desired approximation of L(θ).

A. Method and architecture
In order to produce the results that we report in the main text, we performed the following procedure for all three study regions: 1. Select non-overlapping training period, validation period and test period.
2. Train numerous models using the training period, and evaluate them on the validation period. The runs differ by the sizes of the layers in every encoder and in the FERN head model, by the regularization we apply for every layer, by the learning rate, and other hyperparameters.
3. Select good hyperparameters using the validation set. Good hyperparameters give the best validation loss, while the training and validation curves being relatively smooth.
4. Retrain the model using the same hyperparameters on both the training and validation periods. Since there is randomness in the training, train 20 models using these hyperparameters, and select one with the median train loss. Report the log-likelihood on the test period.
In practice, we can evaluate the log-likelihood of the test set with a finer spatial grid (taking a larger value for q) that what was used for training. Thus we get a more accurate log-likelihood score, while keeping the training relatively faster. The values for the log-likelihood that we report below are achieved with such an evaluation.
In the table below we specify all of the values that were chosen for our final reported results. We have 3 regions A, B, C (described in the main text), and 2 models per region, which we call FERN and FERN + . In FERN the input catalog only contains the target earthquakes. In FERN + the input catalog also contains smaller earthquakes, with a magnitude threshold that is termed 'feature magnitude'. Note that the set of examples e i,j is the same for FERN and FERN + , the difference between them is in the richness of the encoding (i.e. they have different history objects).
All of the regularization that are specified above are L2 regularization, per encoder model or head model. The optimizer we use is Adam with the learning rate that is specified in the table. Additionally, we decrease the learning rate by a multiplicative constant that is specified in the 'learning rate decay' row.
The following feature engineering functions were used:

IV. ONE-DAY FORECASTING
In order to demonstrate the the encoders learned some latent representation of the state of seismicity in the trained region, we want to reuse them for a different (related) task. The task we chose is 1-day forecasting.   Specifically, given the history up to the end of one day, the model should output the expected number of earthquakes in spatial bins for the following day. The standard way to get such forecasts using ETAS is by sampling 100,000 simulated catalogs (per day) from the ETAS conditional intensity function, and then averaging the number of predicted earthquakes per cell. This task has a classical form of a machine learning regression problem, since we have labels -the actual number of earthquakes that occurred in every bin.
Here we choose to forecast in 0.5 × 0.5 [deg 2 ] cells. Additionally, we do not have magnitude bins. So, our labels and forecasts can be denoted as Ω = {ω i,j |i ∈ T, j ∈ S} and Λ = {λ i,j |i ∈ T, j ∈ S} (respectively), where T is the set of indices of days, and S is the set of indices of grid cells.
We compare between the performance of ETAS and FERN (and FERN + ) by calculating common metrics -L-test, S-test and N-test ( [7]). Note that often these metrics are evaluated by comparing the score that a forecast is getting for the actual observed catalog, against catalogs that were sampled from the forecast (assuming that the forecast is actually reporting the Poisson rate of earthquakes in every bin). This is a way of evaluating how good a forecast is without competing models. We prefer to compare between models, and therefore we do not do this simulation step. Instead, we calculate the different metrics for the three competing models, and compare them against each other. Additionally, we calculate the scores for two simple modelsa 'Poisson' model, which learns a constant rate of earthquakes per spatial bin, and a 'Perfect' model, which always outputs the correct output (that is, λ P erf ect i,j = ω i,j ). These models give us a scale for the difference between ETAS, FERN and FERN + .
• L-test: the log-likelihood of the observed catalog, under the assumption that the model is outputting Poisson rates per bin: • S-test: the log-likelihood of the spatial distribution of the observed catalog, under the assumption that the average output of the model per spatial bin is a Poisson rate: Define N obs = i∈T,j∈S ω(i, j) and N f ore = i∈T,j∈S λ(i, j), which are the total numbers of observed and forecasted earthquakes (respectively). Denote the spatial distributions Ω S = {ω S j |j ∈ S} and Λ S = {λ S j |j ∈ S}, where: ω S j = i∈T ω i,j , and λ S j = N obs N f ore i∈T λ i,j Then the score is: S-test(Ω S |Λ S ) = j∈S (ω S j · log(λ S j ) − λ S j ).
• N-test: the probability of observing the actual number of earthquakes, under the assumption that the model is outputting Poisson rates per bin. We can calculate the cumulative distribution of the number of earthquakes: The N-test metrics are the probabilities of observing at least N obs − 1 earthquakes and at most N obs earthquakes: δ 1 = 1 − F (N obs − 1|N f ore ) and δ 2 = F (N obs |N f ore ).
In all of our models (FERN and FERN + , for three regions), we take the trained encoders, and use them to calculate latent representations for the state of seismicity (by inputting the time of the end of a day, and the location of the center of a spatial bin). Then we train a simple fully connected model to map this latent representation to the Ω labels, optimizing for the standard Poisson loss (which is equivalent to optimizing for the L-test). We use the same four step procedure that is described in Section III A in order to select the best model. The chosen hyperparameters appear in the Head regularization 5e-7 4e-7 3e-8 3e-7 4e-8 3e-8